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While  the  contents  of  this  report  are  considered  to  be  correct, 
they  are  subject  to  modification  upon  further  study.  This 
report  does  not  promulgate  official  Air  Force  policies  or 
positions.  The  technical  conclusions  are  solely  those  of  the 
authors . 


FOREWORD 


This  report  and  the  associated  software  were  prepared  by  the 
Modeling  Section  of  the  Air  Force  Manpower  and  Personnel  Center 
in  response  to  the  need  to  solve  large  multivariate  regression 
problems  (100  attributes  with  up  to  30,000  observations)  in 
the  management  of  the  one  million  plus  personnel  employed  by 
the  Air  Force.  The  collaboration  of  the  following  people  is 
acknowledged:  LCDR  C.  Pennington,  Lt  D.  Hemphill,  CMS  L.  Staton, 
and  TSG  D.  Francis. 


EXECUTIVE  SUMMARY 


Problem 

Air  Force  Personnel  managers  must  be  able  to  accurately 
forecast  the  force  size.  Taken  in  a  more  general  context, 
managers  must  be  able  to  forecast  a  system  response  to 
independent  stimuli  .  Ordinary  Least  Squares  (O.L.S.) 
multivariate  regression  has  been  used  to  meet  this  need. 
O.L.S.  Regression  consumes  inordinate  ADP  resources. 

Objective 

Reduce  ADP  resource  usage  in  regression  studies. 

Approach 

Isolate  the  portion (s)  of  regression  which  account  for  the 
greatest  resource  consumption  and  optimize. 

Results 


The  major  portion  (75-95  percent)  of  the  ADP  resource  was 
found  to  be  expended  in  the  computation  of  intercorrelation 
matrices.  Since  many  regression  problems  are  sparse  due  to 
the  use  of  dummy  variables,  the  introduction  of  logic  to 
omit  zeroe  attribute  values  within  an  observation  has 
provided  ADP  resource  savings  as  high  as  90  percent. 

Conclusions 


1.  Regression  problems  of  less  than  91.5%  density  will  yield 
economies . 

2.  Data  files  should  not  be  processed  with  general  utility 
regression  programs. 

3.  As  a  second  choice  process  data  files  with  stand-alone 
correlation  matrix  builders. 

4.  The  greatest  economies  will  be  realized  if  no  data  file 

is  created  for  regression  analysis.  Rather  insert  correlation 
matrix  build  logic  in  the  ADP  system  at  the  point  where 
regression  file(s)  would  be  created. 

5.  Any  of  several  matrix  input  regression  packages  may  then 
be  used  to  perform  the  actual  stepwise  or  multiple  regression. 


ABSTRACT 

Air  Force  personnel  managers  must  be  able  to  accurately 
forecast  the  force  size.  This  need  is  explicit  in  meeting 
statutory  budget  limitations.  Further,  officer  losses  drive 
accessions,  training,  and  promotion;  thus  the  need  for  accuracy 
in  forecasting  losses  cannot  be  over-emphasized.  To  accomplish 
this  objective,  loss  rates  have  been  generated  using  Ordinary 
Least  Squares  (OLS)  stepwise  regression,  run  on  what  are  locally 
dubbed  the  "binary  files".  The  purpose  of  this  paper  is  to 
report  a  front-end  processor  to  OLS  which  has  reduced  computer 
run  time  by  85  percent  for  this  organization. 


INTRODUCTION 


Numerous  software  packages  are  available  to  solve  the  multi¬ 
variate  regression  problem.  To  wit  ;  SPSS  Incorporated's 
statistical  package  for  the  social  sciences,  the  Biomedical 
Division's  BMD02R,  Burroughs'  Advanced  Statistical  Inquiry 
System  (BASIS),  and  Greenberger  &  Ward's  Iterative  Method. 

These  methods  and  presumably  a  host  of  others  capitalize 
upon  the  ability  to  sequentially  consume  virtually  an  unlimited 
amount  of  data,  reduce  the  data  to  a  correlation  matrix,  and 
"solve  the  problem".  This  feature  of  sequential  processing 
of  data  into  a  relatively  small  core-resident  matrix  is  one 
of  the  main  selling  points  for  OLS  regression.  However,  the 
run  time  of  most  regression  packages  is  notorious.  In  fact, 
the  literature,  both  proprietary  and  public,  note  that  providing 
a  correlation  matrix  as  input  to  the  regression  program  will 
significantly  reduce  the  regression  run  time.  In  application 
this  approach  saves  time  only  if  the  user  wishes  to  run 
various  "sub-problems"  against  the  same  data  file.  (For  example 
one  may  wish  to  make  individual  runs  with  the  same  independent 
variables  against  two  or  more  dependent  variables.)  The  lack 
of  savings  observed  in  running  the  single  problem  results 
because  the  data  file  must  still  be  sequentially  processed  to 
build  the  correlation  matrix.  This  leads  to  the  incontrovertible 
conclusion  that  in  order  to  make  any  money  in  reducinq  regression 
run  time,  the  correlation  matrix  generation  loqic  must  be  attacked. 
This,  then  is  our  approach. 


This  paper  reports  three  methods  which  have  resulted  in 
significant  computer  resource  savings  in  our  application  of 
O.L.S.  It  is  presumed  that  readers  of  this  paper  are  generally 
conversant  with  O.L.S.  methodology.  Therefore  the  objective 
of  this  paper  will  be  to  provide  techniques  which  may  provide 
reductions  in  run  time,  and  not  a  tutorial  on  regression  or 
correlation  derivation. 


METHOD 


General 

A  straight-f orward  method  of  doing  regression  is  to  sequentially 
read  the  observations  into  an  array,  say  VAF.^  where  i  varies 
from  1  to  the  number  of  attributes  and  then  execute  code 
similar  to  the  following: 

SUM(I)  =  SUM ( I )  +  VAR ( I ) 

SSUM(I)  =  SSUM(I)  +  VAR  ( I )  *  VAR  ( I ) 

XYSUM  (I,J)  =  XYSUM  (I,J)  +  VAR  (I)  *  VAR  (J) 
for  I  =  1  to  number  of  attributes 
J  =  1  to  number  of  attributes 

After  processing  all  observations  as  above,  then  compute  the 
means,  standard  deviations,  and  the  intercorellation  matrix  as 
below:  Where  N  =  the  number  of  observations. 


XMEAN ( I ) 
STDEV(I) 


SUM ( I ) /N 

fSSUM(I)  -  N  *  XMEAN ( I ) *  XMEAN ( I ) ^  **  0. 
V.  N-l  / 


R ( I , J )  =  XYSUM ( I, J)  -  SUM ( I ) *SUM ( J ) /N 
(N-l) *STDEV ( I ) *STDEV ( J) 


The  above  produces  the  means  and  standard  deviations  for  the 
attributes  and  the  Pearson  Product-Moment  correlation  coefficients 
which  can  be  used  as  input  to  a  properly  designed  regression 
program.  But  at  what  cost? 


If  the  above  logic  were  applied  to  a  100  attribute  problem 
with  3  thousand  observations,  the  SUM  (I)  and  SSUM  (I) 
computations  would  each  be  executed  300  thousand  times  taking 
approximately  15  seconds  CPU  time.  The  XYSUM  (I,J)  computa¬ 
tion  would  be  executed  30  million  times  at  a  cost  of  about 
11  hundred  seconds  CPU  time.  All  of  this  is  in  addition  to 
the  time  required  to  read  the  file  and  handle  various  other 
statements  required  to  complete  the  set  of  executable  program 
logic.  (all  timing  estimates  are  for  a  Burroughs  B6700) . 

There  are  obvious  savings  in  the  XYSUM  (I,J)  computations  since 
the  XYSUM  matrix  is  svmetric  to  the  diagonal.  Automatically 
the  cost  can  be  reduced  by  49.5%  or  to  556  seconds  of  CPU  time. 
Additionally  the  diagonal  of  the  correlation  matrix  contains 
only  l's;  therefore  that  XYSUM  on  the  diagonal  is  extraneous. 

This  reduces  the  time  required  to  545  seconds  CPU.  These 
economies  are  recognized  by  most,  but  not  all,  statistical 
packages  which  provide  O.L.S.  regression.  There  are,  however, 
potential  problems  with  the  approach  above. 

Computers  are  limited  in  the  number  of  significant  digits  that 
can  be  represented  in  a  real  number.  And  in  regression  we 
frequently  deal  with  big  numbers.  Specifically,  on  the  AFMPC 
B6700  a  real  number  contains  11  significant  digits.  Thus 
truncation  errors  may  develop  if  the  sums,  sums  of  squares, 
or  cross-product  sums  exceed  10  to  the  11th  power.  Depending 
upon  cirmcumstances ,  this  could  result  in  attempting  to  compute 
the  square  root  of  a  negative  number  or  simply  erroneous  standard 
deviations  and  correlation  elements.  Since  many  regression 


algorithms  have  not  planned  for  this  anomaly,  the  researcher 
is  cautioned  to  consider  this  possibility  when  dealing  with 
big  numbers,  particularly  if  the  attempted  regression  run 
terminates  with  "INVALID  ALOG  ARGUMENT". 

Thus  far  we’ve  reduced  the  computation  of  XYSUM  by  over  one- 
half  for  those  statistical  packages  which  do  not  recognize 
the  symmetry  of  the  matrix.  The  next  area  and  the  most 
important  is  the  database  itself.  The  data  files  we  utilize 
daily  are  files  created  from  raw  data,  the  Master  Personnel 
File  (MPF ) .  The  raw  data  is  converted  to  what  we  call 
"binary  files".  Each  observation  contains  upwards  of  100 
variables  or  attributes.  Most,  but  not  all,  attributes  are 
"1"  or  "0",  e.g.,  either  the  individual  has  a  regular  commission 
or  he  does  not.  Our  new  composite  binary  file  contains  on/off 
state  variables  and  continuous  variables,  such  as  age.  No 
statistical  packages  yet  observed  recognize  the  economics  of 
checking  for  a  value  of  "0"  for  an  attribute  before  doing  the 
SUM  (I),  SSUM  (I),  md  XYSUM  (I,J)  computations.  Our  point  is; 
a  "0"  added  to  a  value  of  SUM  (I)  is  the  original  SUM  (I)  and 
that  squared,  is  the  original  SSUM  (I) ,  so  why  do  them?  By 
inclusion  of  2  lines  of  code  (in  most  cases) ,  we  check  to  see 
if  the  value  is  "0",  increment  the  counter  by  one,  and  step 
back  to  process  the  next  value  instead  of  going  thru  all  the 
calculai  >ns  to  end  up  eith  the  same  results.  We  use  the  same 
logic  for  XYSUM  (I,J).  The  following  is  a  sample  of  the  code 
from  our  local  regression  package.  The  starred  lines  are  the 


added  code. 


DO  80  J=l,  KK 


*IF(VAR(J) .EQ.O)  GO  TO  80 
SUM  (J)  =  SUM  (J)  +  VAR(J) 

SSUM  (J)  =  SSUM  (J)  +  VAR  (J)  *  VAR  (J) 

DO  110  K= 1 ,  J-l 

*IF  (VAR(K) .EQ.O)  GO  TO  110 

XYSUM  ( J, K) =XYSUM  (J,K)  +  VAR  (J)  *  VAR  (K) 

110  CONTINUE 
80  CONTINUE 

The  added  lines  of  code  will  cost  us  some  CPU  time.  The  overall 
CPU  runtime  reduction  is  clearly  a  function  of  the  file  density 
A  test  problem  was  constructed  to  see  at  what  point  would  the 
modifications  to  the  regression  break  even  with  the  regression 
run  time.  To  accomplish  this  the  inner  and  outer  DO  state¬ 
ments  were  timed  in  a  variety  of  computer  mixes  on  the  B6700. 

To  process  a  10,000  observation  problem  with  100  variables;  the 
average  run  time  was; 

outer  computations  =  56  seconds 
inner  computations  =  2200  seconds 
Total  2256  seconds 

The  worst  case  with  the  new  regression  code  and  an  80%  dense 
file  was: 

outer  computations  45  seconds 
inner  computations  2216  seconds 
Total  2261  seconds 

If  a  file  is  80%  or  less  dense  then  the  inclusion  of  the  extra 
lines  of  code  is  cost  effective. 


... 


To  be  able  to  reduce  CPU  time  enables  drastic  reductions  in 
clock  time.  Clock  time  represents  a  tie  up  of  computer 
resources  and  personnel.  Two  actual  files  with  28%  density 
are  presented  below: 

a.  File  size:  2374  observations 

Attributes:  92  variables 

File  Density:  28% 

File  Format:  Formatted 

CPU  run  time:  1375  seconds  versus  230  seconds 
Clock  time:  1  hr,  9  min,  57  sec  versus  6  min  58  sec 

b.  File  size:  4197 

Attributes:  94 

File  Density:  28% 

File  Format:  Formatted 

CPU  run  time:  2676  sec  versus  322  sec. 

Clock  time:  2  hrs,  15  min  55  sec  versus  10  min  46  sec 

In  example  a.,  the  CPU  run  time  was  reduced  83%  while  the  clock  ti 

was  reduced  90%.  In  example  b.  the  CPU  run  time  was  reduced 

87%  while  the  clock  time  was  reduced  96%.  This  is  significant 

savings  as  the  density  of  our  average  files  is  28%.  To  restate 

the  objectives  we  had;  we  needed  to  run  ^egressions  with  matrix 

input  and  needed  to  speed  up  the  matrix  build  process.  The 

savings  in  CPU  and  clock  time  is  geometric.  Once  a  parent 

binary  file  is  created  capturing  data  at  given  points  in  time 

numerous  regressions  can  be  run  on  the  file  utilizing  subsets 

of  variables  to  answer  a  variety  of  questions  in  the  same  or 

less  amount  of  time  it  took  to  run  only  one  regression. 


Aoolication 


Firs-:  and  foremost,  tie  user  must  avoid  processing  the  raw 
data  with  off-the-shelf,  generalized  software.  And,  if 
reading  the  data  file  can  be  avoided  all  together,  even 
better.  Locally  we  have  modified  production  regression 
applications  in  both  of  the  manners  indicated. 

In  the  first  case  our  regression  program  was  modified  at  the 
point  where  the  data  file  would  be  sequentially  consumed.  At 
this  point  a  procedure  is  invoked  which  processes  all  of  the 
raw  data  and  returns  the  intercorrelation  matrix,  the  means, 
and  the  standard  deviations  to  the  regression  program.  Using 
the  procedure  shown  below  in  figure  #1,  we  have  realized 
savings  of  75-85  percent  in  CPU  and  in  excess  of  90  percent 
in  elapsed  time  processing  data  files  which  average  28  percent 
density.  This  specific  application  has  been  generalized  to 
produce  a  utility  procedure  which  may  be  invoked  by  any  calling 
program  requiring  the  means,  standard  deviations,  and  a  cor¬ 
relation  matrix.  Appendix  A  presents  the  utility  procedure  in 
ALGOL.  As  can  be  readily  seen  from  the  program  documentation, 
this  procedure  may  be  included  in  a  calling  program  to  process 
observations  as  sequentially  read,  or  introduced  earlier  in  a 
system.  It  is  this  latter  application  which  provides  the 
greatest  efficiency. 


Our  second  application  of  these  enhancements  to  regression  has 
been  accomplished  by  early  introduction  into  an  existing  system 
of  software  similar  to  that  above,  in  figure  #1.  In  this  case 
a  system  in  which  two  files  were  merged  produced  a  "binary" 
file  for  subsequent  processing  in  a  stand-alone  regression 
package.  In  the  system  as  it  previously  operated,  the  two 
files  were  merged  and  written  to  diskpack  for  later  regression 
analysis.  At  this  point  program  logic  was  added  to  the  "merge" 
program  to  compute  the  necessary  data  for  producing  the  means, 
standard  deviations,  and  correlation  matrix.  This,  thus  saves 
the  ADP  time  to  write  and  read  a  file  plus  the  storage  medium 
to  store  data  between  processing  steps.  Resource  usage  was 
markedly  reduced. 

The  implementation  in  the  second  case  realized  the  preatest 
reduction  in  computer  resources.  Regression  problems  which  had 
run  in  1400  CPU  seconds  are  now  processed  in  120  CPU  seconds. 

In  addition  DASD  storage  can  be  reduced. 

Appendix  B  presents  the  HOL  source  code  illustrating  one  way 
to  insert  the  time  consuming  portion  of  regression  into  a 
extant  ADP  application. 


CONCLUSION/RECOMMENDATION 
OLS  regression  requires  high  CPU  resources. 

Applied  regression  typically  produces  low  density  files. 

That  is,  introduction  of  dummy  variables  to  accomodate  non¬ 
linearity  results  in  lots  of  zeros. 

Files  of  less  than  80%  density  can  be  processed  more  rapidly 
by  improved  logic. 

Don't  use  off-the-shelf  generalized  regression  packages  to 
build  correlation  matrices. 

As  a  second  choice  use  stand-alone  correlation  matrix 
generator . 

As  a  first  choice  hardwire  correlation  matrix  logic  into 
existing  systems  at  the  point  where  the  data  files  for 
regression  analyses  are  now  produced. 
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